home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
The World of Computer Software
/
The World of Computer Software.iso
/
tags18.zip
/
WILDFILE.DOC
< prev
next >
Wrap
Text File
|
1992-01-06
|
15KB
|
387 lines
WildFile for MS-DOS systems
A *IX SH style file globber written in C
V1.20 Dedicated to the Public Domain
January 07, 1992
J. Kercheval
[72450,3702] -- johnk@wrq.com
01-07-92
This is V1.20 of Wildfile.
Thanks to F. C. Smith for a bug fix when expanding paths of the form
c:* rather than the fully qualified c:\*. This version also utilizes
the generic match code from MATCH V1.20 with appropriate ifdef's for
use with the MSDOS path follow character.
jbk
05-13-91
This is V1.14 of Wildfile.
Thanks to David Kirschbaum of Toad Hall for adding support for Turbo C
V2.0.
jbk
05-07-91
This is V1.13 of Wildfile, a *IX SH style file Globber.
The purpose of this code is to enable the programmer to allow *real*
wildcard file specification. The UNIX (*IX) SH style wildcard system
has been around for decades and there is absolutely no good reason for
it's lack of presence within MS/PC DOS and its associated tools.
I submit this without copyright and with the clear understanding that
this code may be used by anyone, for any reason, with any modifications
and without any guarantees, warrantee or statements of usability of any
sort.
jbk
*IX SH style wildcard file globbing
===================================
The unix style of wildcard globbing (matching files to a wildcard
specification) is quite a bit more flexible than the standard
approach seen on the MS\PC DOS machines. The full power of *IX SH
style regular expressions are allowed to specify a file name. For
instance:
"*t*" would match to the filenames test.doc, wet.goo,
itsy.bib, foo.tic, etc.
"th?[a-eg]." would match to any file without an extension,
whose first two letters were "th", with any third
letter and whose last letter was a,b,c,d,e or g.
(ie. thug, thod, thud, etc.)
"*" would match all filenames.
The regular expression syntax is described in detail in the source
code and below.
Implementation
==============
The implementation of the wildcard package is similar in type to the
standard MS/PC DOS function calls for file searches. There is a
find_firstfile call which begins a search initially and a
find_nextfile call which continues a previous search. This approach
will normally yield a very quick port from existing *standard*
implementations of wildcard file searching.
The include file WILDFILE.H does a good job of describing the
specifics required here.
WD
==
WD is a very quick implementation of a directory lister to try to
show the usage of the wildfile module as intended. The program is a
fully functional program complete with usage messages and command
line argument parsing.
Languages
=========
WILDFILE (and its associated module MATCH) were developed and
compiled using both MicroSoft C V6.00A and Borland C++.
============================================================================
============================================================================
MATCH120
REGEX Globber (Wild Card Matching)
A *IX SH style pattern matcher written in C
V1.20 Dedicated to the Public Domain
January 07, 1992
J. Kercheval
[72450,3702] -- johnk@wrq.com
01-07-91
This is V1.2 of REGEX Globber.
To clarify code I have added defines for the standard characters used
within a match pattern and have reformatted the source code. There
is an internal define to allow this code to be used in filename
regular expression globbing for the MSDOS platform. This consisted
of disabling the use of the literal escape outside of a range so that
path follows (ie '\') would be handled correctly.
jbk
03-12-91
This is V1.1 of REGEX Globber.
03-12-91
I have made a few changes to the match module which do several
things. The first change is an increase in bad pattern detection
during a match. It was possible, in some very unlikely cases, to
cook up a pattern which should result in an early bad match but which
would actually cause problems for the parser. In particular, the
subcase where the literal escape '\' within an open [..] construct at
the end of a pattern would end up with incorrect results. I
proceeded to create some of these patterns, added them to my test
battery and dove straight in.
In the interim I came across a posting to CompuServe (SMATCH by Stan
Aderman) which attempted to create a completely non-recursive
implementation of match (I am not sure this is possible without
explicitly creating your own stack or it's equivalent, like a binary
tree :-{ ). While the code could not correctly handle multiple '*'
characters in the pattern, there was a few interesting ideas in the
posting. On some occasions, running match over and over would be
counter productive, especially and in particular when you have a bad
pattern. I have added a fast routine, is_valid_pattern(), to
determine if the current pattern is well formed which should address
this situation.
One other idea which I unceremoniously lifted from SMATCH was (in
hindsight a pretty obvious feature) the return of a meaningful error
code from both the pattern validity routine and from match() (which I
renamed to matche()).
I also took some time to experiment with some ways to cut some time
off the routine. Since this is a SH pattern matcher whose intent is
primarily for shell functions, the changes could not be algorithmic
changes which relied on speedup over large input. The differences in
execution time were not very significant, but I did manage to gain
approximately 5%-10% speedup when I removed the literal escape ('\')
parsing and pattern error checking. For those of you who want to use
this for filename wildcard usage, I would recommend doing this since
you should use is_valid_pattern and is_pattern before going out and
finding filenames and the dos path delimiter defaults to the
character used for the literal escape ('\') anyway (Note: I will be
soon be releasing a *IX style file parser in the FINDFILE, FINDNEXT
flavor soon to a Public Domain archive near you :-) ).
I also briefly toyed with adding a non-SH regex character '+' to this
module but removed it again. It was a performance hit of a few
percent and would be mostly unused in any event. For those
interested in such a feature, the changes are truly minimal. The
required extra work is:
1) One case statement each in is_pattern() and is_valid_pattern()
2) One case statement in matche()
3) One addition to a while conditional in matche_after_star()
4) One addition to an if conditional in matche_after_star()
Hint: The case statements are all "case '+'" and the conditionals
have "|| *p == '+' " added to them.
I have also included a file (MATCH.DOC) which describes matches use and
background as well as a little about regular expressions.
jbk
02-24-91
This is V1.01 of REGEX Globber.
02-22-91 Seattle, WA
Hmm. Choke. (Foot in mouth). After griping about buggy routines and
literally seconds after posting this code the first time, I received
a wonderful new test evaluation tool which allows you to perform
coverage analysis during testing. Sure enough I found that about
25% of the paths in the program were never traversed in my current
test battery. After swallowing my (overly large) pride and coming
up with a test battery which covered the entire path of the program
I found a couple of minor logic bugs involving literal escapes (\)
within other patterns (ie [..] and * sequences). I have repackaged
these routines and included also the makefile I use and the test
battery I use to make things a bit easier.
jbk
02-20-91 Seattle, WA
Here is a *IX wildcard globber I butchered, hacked and cajoled together
after seeing and hearing about and becoming disgusted with several similar
routines which had one or more of the following attributes: slow, buggy,
required large levels of recursion on matches, required grotesque levels
of recursion on failing matches using '*', full of caveats about usability
or copyrights.
I submit this without copyright and with the clear understanding that
this code may be used by anyone, for any reason, with any modifications
and without any guarantees, warrantee or statements of usability of any
sort.
Having gotten those cow chips out of the way, these routines are fairly
well tested and reasonably fast. I have made an effort to fail on all
bad patterns and to quickly determine failing '*' patterns. This parser
will also do quite a bit of the '*' matching via quick linear loops versus
the standard blind recursive descent.
This parser has been submitted to profilers at various stages of development
and has come through quite well. If the last millisecond is important to
you then some time can be shaved by using stack allocated variables in
place of many of the pointer follows (which may be done fairly often) found
in regex_match and regex_match_after_star (ie *p, *t).
No attempt is made to provide general [pat,pat] comparisons. The specific
subcases supplied by these routines is [pat,text] which is sufficient
for the large majority of cases (should you care).
Since regex_match may return one of three different values depending upon
the pattern and text I have made a simple shell for convenience (match()).
Also included is an is_pattern routine to quickly check a potential pattern
for regex special characters. I even placed this all in a header file for
you lazy folks!
Having said all that, here is my own reinvention of the wheel. Please
enjoy it's use and I hope it is of some help to those with need ....
jbk
*IX SH style Regular Expressions
================================
The *IX command SH is a working shell similar in feel to the MSDOS
shell COMMAND.COM. In point of fact much of what we see in our
familiar DOS PROMPT was gleaned from the early UNIX shells available
for many of machines the people involved in the computing arena had
at the time of the development of DOS and it's much maligned
precursor CP/M (although the UNIX shells were and are much more
flexible and powerful then those on the current flock of micro
machines). The designers of DOS and CP/M did some fairly strange
things with their command processor and OS. One of those things was
to only selectively adopt the regular expressions allowed within the
*IX shells. Only '?' and '*' were allowed in filenames and even with
these the '*' was allowed only at the end of a pattern and in fact
when used to specify the filename the '*' did not apply to extension.
This gave rise to the all too common expression "*.*".
REGEX Globber is a SH pattern matcher. This allows such
specifications as *75.zip or * (equivalent to *.* in DOS lingo).
Expressions such as [a-e]*t would fit the name "apple.crt" or
"catspaw.bat" or "elegant". This allows considerably wider
flexibility in file specification, general parsing or any other
circumstance in which this type of pattern matching is wanted.
A match would mean that the entire string TEXT is used up in matching
the PATTERN and conversely the matched TEXT uses up the entire
PATTERN.
In the specified pattern string:
`*' matches any sequence of characters (zero or more)
`?' matches any character
`\' suppresses syntactic significance of a special character
[SET] matches any character in the specified set,
[!SET] or [^SET] matches any character not in the specified set.
A set is composed of characters or ranges; a range looks like
'character hyphen character' (as in 0-9 or A-Z). [0-9a-zA-Z_] is the
minimal set of characters allowed in the [..] pattern construct.
Other characters are allowed (ie. 8 bit characters) if your system
will support them (it almost certainly will).
To suppress the special syntactic significance of any of `[]*?!^-\',
and match the character exactly, precede it with a `\'.
To view several examples of good and bad patterns and text see the
output of MATCHTST.BAT
MATCH() and MATCHE()
====================
The match module as written has two parsing routines, one is matche()
and the other is match(). Since match() is a call to matche() which
simply has its output mapped to a BOOLEAN value (ie TRUE if pattern
matches or FALSE otherwise), I will concentrate my explanations here
on matche().
The purpose of matche() is to match a pattern against a string of
text (usually a file name or specification). The match routine has
extensive pattern validity checking built into it as part of the
parser and allows for a robust pattern match.
The parser gives an error code on return of type int. The error code
will be one of the the following defined values (defined in match.h):
MATCH_PATTERN - bad pattern or misformed pattern
MATCH_LITERAL - match failed on character match (standard
character)
MATCH_RANGE - match failure on character range ([..] construct)
MATCH_ABORT - premature end of text string (pattern longer
than text string)
MATCH_END - premature end of pattern string (text longer
than pattern called for)
MATCH_VALID - valid match using pattern
The functions are declared as follows:
BOOLEAN match (char *pattern, char *text);
int matche(register char *pattern, register char *text);
IS_VALID_PATTERN() and IS_PATTERN()
===================================
There are two routines for determining properties of a pattern
string. The first, is_pattern(), is designed simply to determine if
some character exists within the text which is consistent with a SH
regular expression (this function returns TRUE if so and FALSE if
not). The second, is_valid_pattern() is designed to check the
validity of a given pattern string (TRUE return if valid, FALSE if
not). By 'validity', I mean well formed or syntactically correct.
In addition, is_valid_pattern() has as one of it's parameters a
return code for determining the type of error found in the pattern if
one exists. The error codes are as follows and defined in match.h:
PATTERN_VALID - pattern is well formed
PATTERN_ESC - pattern has invalid literal escape ('\' at end of
pattern)
PATTERN_RANGE - [..] construct has a no end range in a '-' pair
(ie [a-])
PATTERN_CLOSE - [..] construct has no end bracket (ie [abc-g )
PATTERN_EMPTY - [..] construct is empty (ie [])
The functions are declared as follows:
BOOLEAN is_valid_pattern (char *pattern, int *error_type);
BOOLEAN is_pattern (char *pattern);